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Abstract. A traffic performance measurement system, PeMS, currently 
functions as a statewide repository for traffic data gatliered by thou- 
sands of automatic sensors. It lias integrated data collection, processing 
and communications infrastructure with data storage and analytical 
tools. In this paper, we discuss statistical issues that have emerged as 
we attempt to process a data stream of 2 GB per day of wildly varying 
quality. In particular, we focus on detecting sensor malfunction, impu- 
tation of missing or bad data, estimation of velocity and forecasting of 
travel times on freeway networks. 

Key words and phrases: ATIS, freeway loop data, speed estimation, 
malfunction detection. 
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1. INTRODUCTION 

As vehicular traffic congestion has increased, es- 
pecially in urban areas, so have efforts at data col- 
lection, analysis and modeling. This paper discusses 
the statistical aspects of a particular effort, the Free- 
way Performance Measurement System (PeMS). We 
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begin this introduction with some general discussion 
of data collection and traffic modeling and then de- 
scribe PeMS. 

1.1 Data Collection and Traffic Modeling 

Traffic data are collected by three types of sen- 
sors. The first type is a point sensor, which provides 
estimates of flow or volume, occupancy and speed at 
a particular location on the freeway, averaged over 
30 seconds. Ninety percent of point sensors are in- 
ductive loops buried in the pavement; the others are 
overhead video cameras or side- fired radar detectors. 
Point sensors provide continuous measurement. The 
large amount of data they provide can be used for 
statistical analysis. 

The second type of sensors are implemented by 
floating cars that record GPS or tachometer read- 
ings from which one can construct the vehicle trajec- 
tory. Floating cars are expensive since they require 
drivers. Departments of Transportation (DoTs) typ- 
ically deploy floating cars once or twice a year on 
stretches of freeway that are congested to determine 
travel time and the extent of the freeway that is 
congested. The data are insufficient for reliable es- 
timates of travel time variability. 

The third type of sensor can be used in areas in 
which vehicles are equipped with RFID tags. These 
tags are used for electronic toll collection (ETC). 
In the San Francisco Bay Area, for example, ETC 
tags are used for bridge toll collection. ETC readers 
are deployed at several locations, in addition to the 
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bridge toll booths. These readers collect the tag ID 
and add a time stamp. By matching these at two 
consecutive reader locations, one gets the vehicle's 
travel time between the two locations. (One may 
view these data as samples of floating car trajec- 
tories.) The www.511.org site displays travel times 
estimated using these data. Of course, this type of 
sensor can only be deployed in a few locations. More- 
over, the penetration of ETC tags in the whole ve- 
hicle population, and hence the data they provide, 
varies by time of day and day of week. 

In addition, there are special data sets obtained 
from surveys. 

Point sensors implemented by inductive loops pro- 
vide 95% of the data used by DoTs and traffic an- 
alysts worldwide. These data are used for two pur- 
poses: real-time traffic control and building traffic 
flow models for planning. 

The primary traffic control mechanism is ramp 
metering, which controls the volume of traffic that 
enters the freeway at an on-ramp. The rate of flow 
depends on the density of traffic on the freeway, 
estimated from real-time loop data. Measurement, 
modeling and control are discussed in Papageorgiou 
(1983) and Papageorgiou et al. (1990), for example. 

Real-time and historical data are also used to es- 
timate and predict travel times. Travel time pre- 
dictions are posted on the web and on changeable 
message signs on the side of the freeway. Attempts 
to process these data to estimate the occurrence of 
an accident have been unsuccessful, because of high 
false alarm rates. 

Simulation models are used by regional transporta- 
tion planners to predict changes in the pattern of 
traffic through a freeway network as a result of pro- 
jected increase in demand or the addition of a lane 
or extension of a highway. The models are more fre- 
quently used to predict the impact of proposed shop- 
ping or housing development, or, in an operational 
context, to compare different alternatives to relieve 
congestion at some location. Microscopic models, 
such as TSIS/CORSIM, TRANSIMS, VISSIM and 
Paramics predict the movement of each individual 
vehicle. In macroscopic models, such as TRANSYT, 
SYNCHRO and DYNASMART, the unit of analy- 
sis is a platoon of vehicles or macroscopic variables 
such as flow, density and speed. URLs for these sim- 
ulation models are given in the list of references. A 
fascinating overview and discussion of microscopic 
and macroscopic traffic models is provided by Hel- 
bing (2001). 



Microscopic models are based on car- following and 
gap-acceptance models of driver behavior: how closely 
do drivers follow the car in front as a function of 
distance and relative speed; and how big a gap is 
needed before drivers change lanes. The parame- 
ters in these behavioral models are interpreted as 
indicators of driver aggressiveness and impatience. 
Microscopic models have scores of parameters, but 
they are calibrated using aggregate point detector 
data. As a result, most parameters are simply set 
to default values and no attempt is made to esti- 
mate them. Macroscopic models have fewer param- 
eters, which can be estimated with point detector 
data. Typically, however, the estimates are based 
on least squares fit using a few days of data, with 
no attempt to calculate the reliability of the esti- 
mates. In order to predict network-wide traffic flows, 
the models need origin-destination flow data. These 
are converted into link-level flows assuming some 
kind of user equilibrium in which drivers take routes 
that have minimum travel times. Since these travel 
times depend on the link flows themselves, an itera- 
tive procedure is needed to calculate the assignment 
of origin-destination flows to link flows (Yu et al., 
2004). Origin-destination flow data themselves are 
based on survey data or they are inferred from ac- 
tivity models that relate employment and household 
location data, obtained from the Census. 

1.2 The Freeway Performance 
Measurement System 

Over a number of years, the State of California has 
invested in developing Transportation Management 
Centers (TMCs) in urban areas to help manage traf- 
flc. The TMCs receive traffic measurements from the 
field, such as average speed and volume. These data, 
which are updated every 30 seconds, help the oper- 
ations staff react to traffic conditions, to minimize 
congestion and to improve safety. 

More recently, the California Department of Trans- 
portation (Caltrans) recognized that the data col- 
lected by the TMCs is valuable beyond real-time 
operations needs, and a concept of a central data 
repository and analysis system evolved. Such a sys- 
tem would provide the data to transportation stake- 
holders at all jurisdictional levels. It was decided to 
pursue this concept at a research level before invest- 
ing significant resources. Thus, a collaboration be- 
tween Caltrans and PATH (Partners for Advanced 
Transit and Highways) at the University of Califor- 
nia at Berkeley was initiated to develop a perfor- 
mance measurement system or PeMS. 
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PeMS currently functions statewide reposi- 
tory for traffic data gathered by thousands of au- 
tomatic sensors. It has integrated existing Caltrans 
data collection, processing and communications in- 
frastructure with data storage and analytical tools. 
Through the Internet (http://pems.eecs. berkeley. 
edu), PeMS provides immediate access to the data 
to a wide variety of users. The system supports stan- 
dard Internet browsers, such as Netscape or Ex- 
plorer, so that users do not need any specialized 
software. In addition, PeMS provides simple plot- 
ting and analysis tools to facilitate standard engi- 
neering and planning tasks and help users interpret 
the data. 

PeMS has many diff'erent users. Operational traf- 
fic engineers need the latest measurements to base 
their decisions on the current state of the freeway 
network. For example, traffic control equipment, such 
as ramp-metering and changeable message signs, must 
be optimally placed and evaluated. Caltrans man- 
agers want to quickly obtain a uniform and com- 
prehensive assessment of the performance of their 
freeways. Planners look for long-term trends that 
may require their attention; for example, they try to 
determine whether congestion bottlenecks can be al- 
leviated by improving operations or by minor capital 
improvements. They conduct freeway operational 
analyses, bottleneck identification, assessment of in- 
cidents and evaluation of advanced control strate- 
gies, such as on-ramp metering. Individual travelers 
and fieet operators want to know current shortest 
routes and travel time estimates. Researchers use 
the data to study traffic dynamics and to calibrate 
and validate simulation models. PeMS can serve to 
guide development and assess deployment of intelli- 
gent transportation systems (ITS). 

PeMS has many different faces, but at some level 
it is just a simple balance sheet. A transportation 
system consumes public resources. In return, it pro- 
duces transportation services that move people and 
goods. PeMS provides an automated system to ac- 
count for these outputs and inputs through a collec- 
tion of accounting formulas that aggregate received 
data into meaningful indicators. This produces a 
balance sheet for use in tracking performance over 
time and across agencies in a reasonably objective 
manner. Examples of "meaningful indicators" are: 

• hourly, daily, weekly totals of VMT (vehicle-miles 
traveled) , VHT (vehicle-hours traveled) and travel 
time for selected routes or freeway segments (links) , 



• means and variances of VMT, VHT and travel 
time. 

These are simple measures of the volume, quality 
and reliability of the output of highway links. Pub- 
lication each day of these numbers tells drivers and 
operators how well those links are functioning. Time 
series plots can be used to gauge monthly, weekly, 
daily and hourly trends. 

Every 30 seconds, PeMS receives detector data 
over the Caltrans wide area network (WAN) to which 
all 12 districts are connected. Each individual Cal- 
trans district is connected to PeMS through the WAN 
over a permanent ATM virtual circuit. A front end 
processor (FEP) at each district receives data from 
freeway loops every 30 seconds. The FEP formats 
these data and writes them into the TMC database, 
as well as into the PeMS database. PeMS maintains 
a separate instance of the database for each district. 
Although the table formats vary slightly across dis- 
tricts, they are stored in PeMS in a uniform way, so 
the same software works for all districts. 

The PeMS computer at UC Berkeley is a four- 
processor SUN 450 workstation with 1 GB of RAM 
and 2 terabytes of disk. It uses a standard Ora- 
cle database for storage and retrieval. The mainte- 
nance and administration of the database is stan- 
dard but highly specialized work, which includes 
disk management, crash recovery and table config- 
uration. Also, many parameters must be tuned to 
optimize database performance. A part-time Oracle 
database administrator is necessary. 

The PeMS database architecture is modular and 
open. A new district can be added online with six 
person-weeks of effort, with no disruption of the dis- 
trict's TMC. Data from new loops can be incorpo- 
rated as they are deployed. New applications are 
added as need arises. 

PeMS includes software serving three main func- 
tions: operating the database, processing and ana- 
lyzing the data, and providing access to the data via 
the Internet. The processing of the data is done to 
ensure their reliability. It is a fact of life that the 
automatic detectors that generate most of our data 
are prone to malfunction. Detecting malfunction in 
an array of correlated sensors has been a statistical 
challenge. The related problem of imputation of bad 
or missing values is another major concern. 

PeMS provides access to the database through the 
Internet. Using a standard browser such as Netscape 
or Internet Explorer, the user is able to query the 
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database in a variety of ways. He or she can use 
built-in tools to plot the query results, or down- 
load the data for further study. Numerous tools for 
visualization are provided, allowing users to exam- 
ine a variety of phenomena. Visualization tools in- 
clude real-time maps showing levels of congestion, 
flow and speed profiles in space and in time, time 
series for individual detectors, plots displaying de- 
tector health, profiles of incidents in space and time, 
graphics to aid in the identification of bottlenecks, 
displays of delay as a function of space and time, 
and graphical summaries of vehicle miles traveled 
by freeway segment as a function of space and time. 

In this paper we will describe how PeMS works. 
Our emphasis will be on the statistical issues that 
have emerged as we attempt to process a data stream 
of 2 GB per day of wildly varying quality. Real- 
time processing of the data is essential and while 
our methods cannot be optimal or "best" in any 
statistical sense, we aim for them to be as "good" 
as possible under the circumstances, and improvable 
over time. 

The remainder of the paper is organized as fol- 
lows. In Section 2 we describe the basic sensors upon 
which PeMS relies, loop detectors. In Section 3 we 
describe our approaches to detecting sensor mal- 
function and in Section 4 describe how we impute 
values that are missing or in error. Section 5 is de- 
voted to a description of how we estimate veloc- 
ity from the loop detectors, and Section 6 describes 
our method of predicting travel times for users. The 
reader will see that these efforts are very much a 
work in progress, with some aspects well developed 
and others under development. 

2. LOOP DETECTORS 

Caltrans TMCs currently operate many types of 
automatic sensors: microwave, infrared, closed cir- 
cuit television and inductive loop. The most com- 
mon type by far, however, is the inductive loop de- 
tector. Inductive loop detectors are wire loops em- 
bedded in each lane of the roadway at regular in- 
tervals on the network, generally every half-mile. 
They operate by detecting the change in inductance 
caused by the metal in vehicles that pass over them. 
A detector reports every 30 seconds the number of 
passing vehicles, and the percentage of time that it 
was covered by a vehicle. The number of vehicles 
is called flow, the percent coverage is called the oc- 
cupancy. A roadside controller box operates a set 



of loop detectors and transmits the information to 
the local Caltrans TMC. This is done through a 
variety of media, from leased phone lines to Cal- 
trans fiber optics. PeMS currently receives data from 
about 22,000 loop detectors in California. 

A single inductance loop does not directly mea- 
sure velocity. However, if the average length of the 
passing vehicles were known, velocity could be in- 
ferred from flow and occupancy. Estimation of veloc- 
ity or, equivalently, average vehicle length has been 
an important part of our work, which is the subject 
of Section 5. At selected locations, two single- loop 
detectors are placed in close proximity to form a 
"double-loop" detector, which does provide direct 
measurement of velocity, from the time delay be- 
tween upstream and downstream vehicle signatures. 
Most of the loop detectors in California are single- 
loop detectors while double-loop detectors are more 
widely used in Europe. 

For a particular loop detector, the flow (volume) 
and occupancy at sampling time t (corresponding to 
a given sampling rate) are defined as 

where T is the duration of the sampling time inter- 
val, say 5 min, N(t) is the number of cars detected 
during the sampling interval t, Tj is the on-time of 
vehicle j, and J{t) is the set of cars that are de- 
tected in time interval t. The traffic speed at time t 
is defined as 



N{t) 



where vj is the velocity of vehicle j. 

We will use d,t,s,n to denote day, time of day, 
detector station and lane, letting them range over 
1,...,L>, l,...,r, 1,...,S and 1,...,A^. By "sta- 
tion" we mean the collection of loop detectors in 
the various lanes at one location. Flow, occupancy, 
speed measured from station s, lane / at time t of 
day d will be denoted as 

qsM^t),ks^i{d,t),Vs,i{d,t). 

We will also index detectors by z = 1, . . . , / in some 
cases and use t to denote sample times, so that no- 
tations like qi{t), qs,i(t), etc. will be seen as well. 

Single-loop detectors are the most abundant source 
of traffic data in California, but loop data are often 
missing or invalid. Missing values occur when there 
is communication error or hardware breakdown. A 
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loop detector can fail in various ways even when it 
reports values. Payne et al. (1976) identified vari- 
ous types of detector errors including stuck sensors, 
hanging on or hanging off, chattering, cross-talk, 
pulse breakup and intermittent malfunction. Even 
under normal conditions, the measurements from 
loop detectors are noisy; they can be confused by 
multi-axle trucks, for example. 

Bad and missing samples present problems for any 
algorithm that uses the data for analysis, many of 
which require a complete grid of good data. There- 
fore, we need to detect when data are bad and dis- 
card them, and impute bad or missing samples in 
the data with "good" values, preferably in real time. 
The goal of detection and imputation is to produce 
a complete grid of clean data in real time. 

3. DETECTING MALFUNCTION 

Figure 1 illustrates detector failure. The figure 
shows scatter plots of occupancy readings in four 
lanes at a particular location. From these plots it 
can be inferred that loops in the first and second 
lanes suffer from transient malfunction. 

The problem of detecting malfunctions can be 
viewed as a statistical testing problem, wherein the 
actual flow and occupancy are modeled as following 
a joint probability distribution over all loop detec- 
tors and times, and their measured values may be 
missing or produced in a malfunctioning state. Let 
Aj(t) = 0,1,2 according as the state of detector i 
at time t is good, malfunctioning, or the data are 
missing. The problem of detecting malfunctioning is 
that of simultaneously testing H : Ai{t) = versus 
K : Ai{t) = 1 or of estimating the posterior proba- 
bilities, P{Ai{t) = l|data). 

Since the model is too general and high dimen- 
sional for practical use, simplification is necessary. 
The most extreme and convenient simplification is 
to consider only the marginal distribution of individ- 
ual (30-second) samples at an individual detector. In 
that case, the acceptance region and the rejection 
region partition the {q, k) plane. 

The early work in malfunction detection used 
heuristic delineations of this partition. Payne et al. 
(1976) presented several ways to detect various types 
of loop malfunctions from 20-second and 5-minute 
volume and occupancy measurements. These meth- 
ods place thresholds on minimum and maximum 
flow, density and speed, and declare data to be in- 
valid if they fail any of the tests. Along the same 
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Fig. 2. Acceptance region of Washington algorithm. 

line, Jacobon, Nihan and Bender (1990) at the Uni- 
versity of Washington deflned an acceptable region 
in the {q, k) plane, and declared samples to be good 
only if they fell inside. We will refer to this as the 
Washington Algorithm. This has an acceptance re- 
gion of the form shown in Figure 2. 

PeMS currently uses a Daily Statistics Algorithm 
(DSA), proposed by Chen et al. (2003), which pro- 
ceeds as follows. A detector is assumed to be either 
good or bad throughout the entire day. For day d, 
the following scores are calculated: 

• Si{i,d) = number of samples that have occupan- 
cy = 0, 

• 82(1, d) = number of samples that have occupan- 
cy > and flow = 0, 

• S3{i,d) = number of samples that have occupan- 
cy > fc* (=0.35), 

• Si{i,d) = entropy of occupancy samples 
[~J2x:p{x)>oPi^)^°SP{x) where p{x) is the his- 
togram of the occupancy]. If ki{d,t) is constant 
in t, for example, its entropy is zero. 

Then the decision Aj = 1 is made whenever Sj > 
■s* for any j = 1,...,4. The values s* were chosen 
empirically. Since this algorithm does not run in real 
time, a detector is flagged as bad on the current day 
if it was bad on the previous day. 

The idea behind this algorithm is that some loops 
seem to produce reasonable data all the time, while 
others produce suspect data all the time. Although 
it is very hard to tell if a single 30-second sample is 
good or bad unless it is truly abnormal, by looking 
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at the time series of measurements for an entire day, 
one can usually easily distinguish bad behavior from 
good. 

This procedure effectively corresponds to a model 
in which flow and occupancy measurement failures 
are independent and identically distributed across 
loops. The trajectory of detector i, {qi{t);ki{t);t = 
1, . . . ,T} is a point in the product space Q x /C x 
T, where Q, /C and T are the space of q,k and t. 
Unlike the Washington algorithm, the partition is 
complicated and impossible to visualize. 

The Daily Statistics Algorithm uses many samples 
(time points) of a single detector. Its main draw- 
backs are (1) that the day-by-day decision is too 
crude, and (2) the spatial correlation of good sam- 
ples is not exploited. Because of (1), a moderate 
number of bad samples at an otherwise good de- 
tector will never be flagged. By (2), we mean that 
some errors that are not visible from a single de- 
tector can be readily recognized if its relationship 
with its spatial and temporal neighbors is consid- 
ered. For example, for neighboring detectors i and 
j, if the absolute difference \qi{t) — qj{t)\ is too big, 
either Aj = 1 or Aj = 1 or both. This has to do with 
the high lane-to-lane (and location-to-location) cor- 
relation of both q and k. Figure 1 illustrates these 
points. Loops in the first and second lanes suffer 
from transient malfunctions, which cannot be eas- 
ily detected from one-dimensional marginal distri- 
butions, but which are immediately clear from the 
two-dimensional joint distributions. From their rela- 
tionships with lanes three and four, one can conclude 
that both detectors are bad. 



The Washington algorithm and the DSA are ad 
hoc in conception, and can surely be improved upon. 
A systematic and principled algorithm is hard to 
develop mainly due to the size and complexity of 
the problem. An ideal detection algorithm needs to 
work well with thousands of detectors, all with po- 
tentially unknown types of malfunction. Even con- 
structing a training set is not trivial since there 
is so much data to examine and it is not always 
possible to be absolutely sure if the data are cor- 
rect even after careful visual inspection. [For ex- 
ample, suppose a detector reports {q,k) = (0,0). It 
could be that the detector is stuck at "off" posi- 
tion but good detectors will also report (0, 0) when 
there are no vehicles in the detection period. Sim- 
ilarly, occupancy measurements stuck at a reason- 
able value will not trigger any alarm if one consid- 
ers only a single detector and a single time.] New 
approaches should include a method of delineating 
acceptance/rejection regions for k and q for multiple 
sensors, combining traffic dynamics theory and man- 
ual identification of good or bad data points, with 
the help of interactive data analysis tools such as 
XGobi (http:/ /www. research. att.com/areas/stat/ 
xgobi/), and an intelligent way of combining evi- 
dence from various sensors to make decisions about 
a particular sensor/observation. 

4. IMPUTATION 

Holes in the data due to missing or bad observa- 
tions must be filled with imputed values. Because of 
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Fig. 1. Scatter plots of occupancies at station 25 of westbound 1-210. 
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the high lane-to-lane and location-to-location corre- 
lation of q and k, it is natural to use measurements 
from neighboring detectors. Although there is flexi- 
bility in the choice of a neighborhood, in practice we 
use the neighborhood defined by the set of loops at 
the same location. Let Af{i) denote the set of neigh- 
boring detectors of i and consider imputing flow, for 
example. 

A natural imputation algorithm is the prediction 
of Qiit) based on its neighbors: 

(2) m=9{ci^{i)ii)), 

where the prediction function g is fit from historical 
data {{qi{t),qj^^i){t),t = 1, . . . ,T)}. (Note that the 
prediction function must be able to properly take 
into account possible configurations of missing and 
bad values among the neighbors; the latter are es- 
pecially problematic, since bad readings may not be 
flagged as such.) 
The simplest idea would be estimation by the mean 

QiidA) = 

• qj{d,t)ii/ij{d,t) = o) 

or median 

Qi {d,t) = median{gj {d,t):j^ -^{i) ,^j{d,t) = 0} 

to be more robust. However, such simple interpola- 
tion is not desirable since the relationships between 
occupancy and flows in neighboring loops are non- 
trivial, that is, qi{t) / Qjit), j G A/'(i), in general. For 
example, at many freeway locations, the inner lane 
has higher flow and lower occupancy for general free 
flow condition than do the outer lanes. Also, if one 
is close to on- or off-ramps, the relationships can be 
quite different. 

The prediction function is rather hard to manage 
in its full generality because of its high dimension- 
ality and because one does not know which values 
will correspond to correctly functioning detectors 
[Aj(t) = 0]. From a computational point of view, 
the following algorithm is thus appealing: 

(3) qi{t) =average(gij(t) : j e J\f{i), Aj = 0), 

where qij{t) = gijiqjif)) is the regression of qiit) on 
qj{t). One computes qij{t) for all j GM{i) and av- 
erages over only those values regressed on "good" 
neighbors. The "average" can be either mean or a 
robust location estimate such as the median. The 



latter seems preferable since all bad samples from 
detectors j € may not be flagged. 

Individual regression function gij{qj{t)) can be flt 
in various ways. Chen et al. (2003) considered the 
linear regression 

qi{t) =ao{i,j) +ai{i,j)qj{t) + noise 

to produce 

qij{d,t) = ao{i,i) + ai{i,j)qj{d,t) 

for each pair of neighbors (i,j), where the parame- 
ters Q;o(^) j)) cti(i, j) are estimated by the least square 
using historical data. This is the approach currently 
being used by PeMS. 

Since this approach relies upon using historical 
data to learn how pairs of neighboring loops behave, 
estimation of the regression functions must be able 
to cope with bad data as well. Cleaning the historical 
data to detect malfunctions is thus necessary, and 
robust estimation procedures may be preferable to 
least squares. We also note that an empirical Bayes 
perspective may be useful in jointly estimating the 
large set of regression functions. 

5. ESTIMATING VELOCITY 

As we have noted earlier, single- loop detectors do 
not directly measure velocity. This is unfortunate, 
because velocity is perhaps the single most useful 
variable for traffic control and traveller information 
systems. In this section we present the method cur- 
rently being used to estimate velocity from single- 
loop data. 

Let us fix a day d and a time of day t and con- 
sider the following situation. Suppose that at a given 
detector during a 30-second time interval, N ve- 
hicles pass with (effective) lengths Li, . . . ,Ln and 
velocities vi, . . . ,vn- (The effective vehicle length is 
equal to the length of the vehicle plus the length of 
the loop's detector zone.) The occupancy is given 
hy k = J2iLiLi/vi. Now, if all velocities are equal, 
V = vi = ■ ■ ■ = vjy, i-t follows that 

, Iv^r 
(4 k=-}_^Li = , 

V ^ V 

where L = J2iLi Li/N is the average of the vehicle 
lengths. We see that if the average vehicle length is 
known, we can infer the common velocity. We model 
the lengths Lj as random variables with common 
mean ^. Note that the Li and L are not directly 
observed. If ^ were known, while the average L is 
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Fig. 3. Velocity (top) and effective vehicle length (bottom) for four weekdays on I-. 



not, then a sensible estimate of the common velocity 
may be obtained by replacing the average by the 
mean in (4): 



(5) 



Rewriting, we find v = vfj,/L. Since the expectation 
of 1/L is not equal to the expectation of v is 
not equal to f . In other words, v is not an unbiased 
estimator of v, despite our assumption that all Vi are 
equal. However, if the number of vehicles is not 
too small, then L should be reasonably close to its 
mean and the bias negligible. Henceforth, we neglect 
this bias issue and use formula (5) to estimate ve- 
locity. We thus focus on estimating the mean vehicle 
length, ft. 

5.1 Estimation of the Mean Vehicle Length 

Currently, it is a widespread practice to take the 
mean vehicle length to be constant, independent of 
the time of day. The validity of this assumption has 
been examined by many authors (e.g.. Hall and Per- 
saud, 1989 and Pushkar et al., 1994), including our- 
selves (Jia et al., 2001) and it is now generally rec- 
ognized that it does not generally hold. This is fur- 
ther illustrated by double-loop data from Interstate 
80 near San Francisco, which allows direct measure- 
ment of velocity. Figure 3 shows the velocity and the 
average (effective) vehicle length at detector station 



2 in the eastbound outer lane 5. We believe that 
the clear daily trend can be ascribed to the ratio of 
trucks to cars varying with the time of day. This is 
confirmed by the fact that the vehicle length in the 
fast lanes 1 and 2, with negligible truck presence, 
is almost constant. We thus assume that the mean 
vehicle length depends on the time of day, denote it 
by fit to reflect this dependence, and consider how 
can be estimated. 

Suppose we have observed N{d, t) and k{d,t) for a 
number of days. Let ao.6 denote the 60th percentile 
of the observed occupancies. Assume that during all 
time intervals when k{d,t) < ao.6 vehicles travel 
at a common velocity vff- Since we may assume 
that any freeway is uncongested at least 60% of the 
time, vpF may be regarded as the free flow velocity. 
Throughout this paper we assume that vpF known 
or estimated from exterior sources of information. 

By our assumption on constant free flow velocity, 
we have for all {d,t) such that k(d,t) < ao.e 



L{d,t) 



VFpk^d, t) 
N{d,t) 



If we assume that the average vehicle length L{d,t) 
does not depend on whether the occupancy is above 
or below the threshold, then 

^(L(d, t) I k{d, t) < ao.e) = EL(^d, t) = /i*. 
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For fixed t we can obtain an unbiased estimate of fit 
as 

1 x-^ VFFk{d,t) 



#{d:k{d,t) < ao.e} 



E 



d : k{d,t)<ao Q 



N{d,t) 



In Figure 4 we have plotted the time of day t versus 
VFFk{d,t)/N{d,t) for all times {d,t) when k{d,t) < 
Oio.6- We can now estimate the expectation fit of 
the effective vehicle length by fitting a regression 
line to this scatter plot, via loess (Cleveland, 1979). 
The smooth regression line seen in Figure 4 is our 
estimator fit of fif Note the absence of points for 
times between 3 P.M. and 6 p.m. when 1-80 East is 
always congested [k{d,t) > ao.e]- 

Once we have an estimator fit of nt, we define a 
(preliminary) estimator of v{d,t) as 

N{d,t)fit 



(6) 



v{d,t) 



k{d,t) 



This estimator and the velocity found by the double- 
loop detector are plotted in Figure 5. We see that it 
performs very well during heavy traffic and conges- 
tion. In particular, it exhibits little bias during the 
time period 3 p.m. to 6 P.M. over which the smooth- 
ing shown in Figure 4 was extrapolated. Unfortu- 
nately, the variance of the estimator during times of 
light traffic, particularly in the early hours of each 
day, is unacceptably large. This is clearly visible in 
Figure 5 with estimated velocities on day 3 around 
1a.m. shooting up to 120 mph shortly before plum- 
meting to 30 mph. The true velocity at that time 
is nearly constant at 64 mph. Recall that our pre- 
liminary estimate (6) is obtained by replacing the 
average (effective) vehicle length L{d,t) by (an esti- 
mate of) its expectation fif. When only a few vehi- 
cles pass the detector during a given time interval, 
the average vehicle length will have a large variance. 
Hence, in light traffic, the average vehicle length is 
likely to differ substantially from the mean. For in- 
stance, if only 10 vehicles pass, then it makes a big 
difference if there are 6 cars and 4 trucks or 7 cars 
and 3 trucks. This explains the large fluctuations of 
our preliminary estimator v during light traffic. 

5.2 Smoothing 

Coifman (2001) suggests a simple fix for the un- 
stable behavior of v during light traffic. He sets the 
estimated velocity equal to the free flow velocity vff 
when the occupancy is low: 

v{d,t), if fc(d, t) > 00.6) 
VFF, otherwise. 



^coifman ('j^) 



The performance of this estimator, in terms of mean 
squared error, is certainly not bad. However, about 
16 out of every 24 hours (60%), the estimated ve- 
locity is a constant and that is not realistic. We can 
do better, in appearance as well as in mean squared 
error. 

It is clear that we need to smooth our preliminary 
estimate v{d,t), but only when the volume is small. 
For the purpose of real-time traffic management, it 
is important that our smoother be causal and easy 
to compute with minimal data storage. Taking all 
this into consideration, we used an exponential filter 
with varying weights. A smoothed version v v is 
defined recursively as 



(7) 



where 



v{d, t) = w{d, t)v{d, t) 



{l-w{d,t))v{d,t-l), 



(8) 



w{d, t) 



N{d,t) 
N{d,t) + C' 



and C is a smoothing parameter to be specified. If 
the time interval is of length 5 minutes, then a rea- 
sonable value would be C = 50. With this value of 
C, if the volume N{d,t) approaches capacity, say 
N{d,t) = 100 vehicles per 5 minutes, then there is 
hardly any need for smoothing and the new obser- 
vation receives substantial weight 2/3. On the other 
hand, if the volume is very small, say N{d,t) = 10, 
then the smoothing is quite severe with the new ob- 
servation receiving a weight of only 1 /6. 

Our filtered estimator v is plotted in Figure 6. The 
correspondence with the true velocity is very good. 
The large variability during light traffic that plagued 
the preliminary estimator v has been suppressed, 
while its good performance during heavy traffic and 
congestion has been retained. 

We will now explain how our filter is "inspired" 
by the familiar Kalman filter. Suppose that the true, 
unobserved velocity evolves as a simple random walk: 



(9) 



vt = vt-.i+£t, et^M{0,T'^). 



Suppose we observe vt = ^tfk/kt = vtfit/Lt, where 
fit is our estimate of ELt = fit- We will work con- 
ditionally on the observed volume Nt- The condi- 
tional expectation of vt is — though not quite equal — 
hopefully close to vt- Using a one-step Taylor ap- 
proximation, we find that the conditional variance 
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Fig. 4. Estimation of the mean effective vehicle length fit- 



of vt is of the order l/Nf. This "inspires" a measure- Finally, we assume that all error terms et and 
ment equation are independent. Note that the variance of the mea- 

surement error depends inversely on the observed 
v^ = vt + S^t volume Nt- In light traffic, when Nt is small the vari- 

(10) ance is large. This is exactly the problem we noted 



in Figure 5. 



T40 




Fig. 5. Our preliminary estimate, defined in (6), superimposed on the true velocity. 
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The Kalman filter recursively computes the condi- 
tional expectation of the unobserved state variable 
vt given the present and past observations ui , -1)2 , . . . , 

In our simple model we can easily derive the Kalman 
recursions. They are 



with 



Wt 



Vt = WtVt + (1 -wt)vt- 



Vt?. 



where Pt is the prediction error ¥.{yt 

We note the similarity of these Kalman recursions 
with our filter (7), although C in (7) is constant and 
the analogue in the Kalman filter is not. We decided 
not to try to estimate and partly because we 
feel that would be difficult to do reliably and partly 
because that would mean taking our simple model 
a little too seriously. 

5.3 Known Free Flow Velocity 

We assume that the free flow velocity f^i? is known, 
which is typically not true. We believe that free flow 
velocity depends primarily on the number of lanes 
and on the lane number, so in practice we use values 
like those shown in Table 1, which are loosely based 



Table 1 

Measured average free flow speeds ( mph ) for each lane 
(rows) of a multilane freeway depending on the total number 
of lanes (columns) 



Lane number 




Number of lanes 




2 


3 


4 


5 


1 


71.3 


71.9 


74.8 


76.5 


2 


65.8 


69.7 


71.0 


74.0 


3 




62.7 


67.4 


72.0 


4 






62.8 


69.2 


5 








64.5 



on experience and empirical evidence from locations 
with double- loop detectors. 

Clearly, it would be preferable to have an inde- 
pendent method to estimate site-specific free flow 
velocity. Petty et al.'s (1998) cross-correlation ap- 
proach works well when occupancy and volume are 
measured in 1-second intervals. However, 20- or 30- 
second measurement intervals are more common and 
at such aggregation this method breaks down. 

5.4 Further Assumptions on Mean Vehicle 
Length 

We have assumed that the mean (expected) ve- 
hicle length /if depends on the time of day only. 
However, we have noticed that fit also depends on: 




Fig. 6. Our estimate V , defined in (7), superimposed on the true velocity. 
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1. Day of the week. The vehicle mix on a Monday 
differs from a Sunday. 

2. Lane. There is a higher fraction of trucks in the 
outer lanes. 

3. Location of the detector station. Certain routes 
are more heavily traveled by trucks than others. 

4. Detector sensitivity. Loop detectors are fairly 
crude instruments that are almost impossible to 
calibrate accurately. If a detector is not properly 
calibrated, the occupancy measurements will be 
biased. 

To account for all this, we must form separate esti- 
mates of fit to cover these different situations. We 
store estimates of /.tf for every 5-minute interval, for 
every day of the week and for every lane at every de- 
tector station. In real time, the appropriate values 
are retrieved, multiplied by the observed volume-to- 
occupancy ratio and filtered. 

5.5 Other Methods 

We briefly review two other methods that also do 
not assume a fixed value for L{d, t), beginning with a 
method described in Jia et al. (2001). Suppose that 
we have a state variable X{d,t) which is during 
congestion and 1 during free flow. The state vari- 
able may be defined, for instance, by thresholding 
the occupancy k{d, t). While the state is "free flow," 
the algorithm tracks L{d,t), assuming constant free 
flow velocity. As soon as the state becomes "con- 
gested," L{d,t) is kept fixed and the velocity v{d,t) 
is tracked. 

The main problem we experienced with this al- 
gorithm is that it depends crucially on X{d,t). In 
particular, if t) = 1 (free flow) while congestion 
has already set in, the method goes badly astray. 
We found it difficult to develop a good rule to define 
X{d,t). In fact, this difficulty was the main reason 
for us to look for a different approach. 

Building on work of Dailey (1999), Wang and Ni- 
han (2000) propose a model-based approach to es- 
timate L(d,t) and v{d,t). Their log-linear model re- 
lates L{d,t) to the expectation and variance of the 
occupancy k{d,t), to the volume N{d,t) and to two 
indicator functions that distinguish between high 
flow and low flow situations. The model has five pa- 
rameters which need to be estimated from double- 
loop data. It is not at all clear if these parameter 
estimates carry over to a particular, single-loop lo- 
cation of interest. Wang and Nihan (2000) defer this 
issue to future research. 



6. PREDICTION 

We now turn our attention to travel time predic- 
tion between any two points of a freeway network 
for any future departure time. Regular drivers, such 
as commuters, choose their routes based on histori- 
cal experience, but factors including daily variation 
in demand, environmental conditions and incidents 
can change traffic conditions. Since heavy congestion 
occurs at the time that most drivers need travel time 
information, free flow travel times, such as those pro- 
vided by MapQuest, are of little use. The result may 
be inefficient use of the network. Route guidance sys- 
tems based on current travel time predictions such 
as variable message boards could thus improve net- 
work efficiency. 

We are currently developing an Internet applica- 
tion which will give the commuters of Caltrans Dis- 
trict 7 (Los Angeles) the opportunity to query the 
prediction algorithm we describe below. The user 
will access our Internet site and state origin, desti- 
nation and time of departure (or desired time of ar- 
rival) , either using text input or interactively query- 
ing a map of the freeway system by pointing and 
clicking. He or she will then receive a prediction of 
the travel time and the best (fastest) route to take. 
It would also be possible to make our service avail- 
able for users of cellular telephones, and in fact we 
plan to do so in the near future. 

6.1 Methods of Prediction 

The task is to forecast the time of a trip from 
loop a to loop b departing at some time in the future, 
using the information recorded up to the current 
time from all intervening loop detectors. One possi- 
ble approach would be to model the physical process 
of traffic flow, using, for example a simulation pro- 
gram such as those mentioned in the Introduction. 
However, such simulations would have to be run in 
real time and be calibrated precisely. In general, it is 
not clear that the best way to predict a functional of 
the complex process of traffic flow is via modeling 
the entire process. For this reason, various purely 
statistical approaches, including multivariate state- 
space methods (Stathapoulis and Karlaftis, 2003), 
space-time autoregressive integrated moving aver- 
age model (Kamarianakis and Prastacos, 2005), and 
neural networks (Dougherty and Cobbett, 1997; Van 
Lint and Hoogendoorn, 2002) have been proposed. 

It is not obvious how to use the information from 
all the intervening loops, but we have found a method 
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based on a simple compression (feature) of this data 
to be remarkably effective (Rice and van Zwet, 2004; 
Zhang and Rice, 2003). From v evaluated at an ar- 
ray of times and loops, we can compute travel times 
Trf(t) that should approximate the time it took to 
travel from loop a to loop b starting at time t on day 
d, by "walking" through the velocity field. We can 
also compute a proxy for these travel times which is 
defined by 



(11) 



6-1 

E 



2ui 



Vi{d,t) +Vi+i{d,t)' 



where Ui denotes the distance from loop i to loop 
(i + 1). We call T* the current status travel time 
(a.k.a. the snap-shot or frozen field travel time). It 
is the travel time that would have resulted from de- 
parture from loop a at time t on day d were there no 
changes in the velocity field until loop b was reached. 
It is important to notice that the computation of 
T^{t) only requires information available at time t, 
whereas computation of T^{t) requires information 
at later times. 

Suppose we have observed vi{d,t) for a number of 
days d£ D in the past, that a new day e has begun, 
and we have observed vi{e, t) at times t <t. We call 
T the "current time." Our aim is to predict Te(r + 5), 
the time a trip that departs from a at time t + 6 will 
take to reach b. Note that even for 5 = this is not 
trivial. 

Define the historical mean travel time as 



(12) 



Two naive predictors of Te(r + 5) are T*{t) and 
u{t + 5). We expect — and indeed this is confirmed 
by experiment — that T*(r) predicts well for small 5 
and z^(r + 6) predicts better for large 5. We aim to 
improve on both these predictors for all 5. 

6.1.1 Linear regression. From the extensive PeMS 
data, we have observed an empirical fact: that there 
exist linear relationships between T*{t) and T[t + 5) 
for all t and 5. This empirical finding has held up in 
all of numerous freeway segments in California that 
we have examined. It is illustrated by Figures 7 and 
8, which are scatter plots of T*{t) versus T[t + 5) for 
a 48-mile stretch of I-IO East in Los Angeles. Note 
that the relation varies with the choice of t and 5. 
We thus propose the following model: 

(13) T{t + 5) = ait, 5) + /3(t, 5)T*{t) + e. 



where e is a zero mean random variable modeling 
random fluctuations and measurement errors. Note 
that the parameters a and (3 are allowed to vary 
with t and 5. Linear models with varying parameters 
are discussed in Hastie and Tibshirani (1993). 

Fitting the model to our data is a familiar lin- 
ear regression problem which we solve by weighted 
least squares. Define the pair (a(t, 6),(3{t, 6)) to min- 
imize 



J2{Us)-a{t,6)-f){t,6)mt)y 



(14) 



K{t + 6-s) 



where K denotes the Gaussian density with mean 
zero and a variance which is a bandwidth param- 
eter. The purpose of this weight function is to im- 
pose smoothness on a and (3 as functions of t and 
5. We assume that a and (3 are smooth in t and 
6 because we expect that average properties of the 
traffic do not change abruptly. The actual prediction 
of Te(r + 6) becomes 

(15) fe(r + 6)= a{T, 6) + /3(t, S)T:{t). 

Writing a{t,6) = a'{t,6)v{t + 5), we see that (13) 
expresses a future travel time as a linear combina- 
tion of the historical mean and the current status 
travel time, our two naive predictors. Hence our new 
predictor may be interpreted as the best linear com- 
bination of our naive predictors. From this point of 
view, we can expect our predictor to do better than 
both, and it does, as is demonstrated below. 




TEtjLMO am) ^itihiiiise} 

Fig. 7. T* (9 a.m.) vs. T(9 a.m. +0 mm). Also shown is 
the regression line with slope a(9 A.M., min) = 0.65 and 
intercept (3(9 A.M., min) = 17.3. 
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Fig. 8. T* (3 p.m.) vs. T(3 p.m. + 60 min^. Also shown is 
the regression line with slope a (3 p.m., 60 min) = 1.1 and 
intercept (3(3 p.m., 60 min) = 9.5. 

Another way to think about (13) is by remember- 
ing that the word "regression" arose from the phrase 
"regression to the mean." In our context, we would 
expect that if T* is much larger than average, signi- 
fying severe congestion, then congestion will proba- 
bly ease during the course of the trip. On the other 
hand, if T* is much smaller than average, congestion 
is unusually light and the situation will probably 
worsen during the journey. 

In addition to comparing our predictor to the his- 
torical mean and the current status travel time, we 
subject it to a more competitive test. We consider 
two other predictors that may be expected to do 
well, one resulting from principal component anal- 
ysis and one from the nearest-neighbors principle. 
Next, we describe these two methods. 

6.1.2 Principal components. Our predictor T only 
uses information at one time point: the "current 
time" T. However, we do have information prior to 
that time. The following method attempts to ex- 
ploit this by using the entire trajectories of Tg and 
T* which are known up to time r. 

Formally, let us assume that the travel times on 
different days are independently and identically dis- 
tributed and that for a given day d, {Td{t) it gT} 
and {TJ(t) -.t € T} are jointly multivariate normal. 
We estimate the large covariance matrix of this mul- 
tivariate normal distribution by retaining only a few 
of the largest eigenvalues in the singular value de- 
composition of the empirical covariance of {{Td{t), 
T^{t)):de D,t £ T}. Define t' to be the largest t 
such that t + Te{t) < r. That is, t' is the (random) 



start time of the latest trip that we would have 
seen completed if we observed day d until time r. 
With the estimated covariance we can now com- 
pute the conditional expectation of Te(T + 5) given 
{Te{t):t < t'} and {T*{t):t < r}. This is a stan- 
dard computation which is described, for instance, 
in Mardia et al. (1979). The resulting predictor is 

6.1.3 Nearest neighbors. As an alternative, we now 
consider another attempt to use information prior 
to the current time r, based on nearest neighbors. 
This nonparametric method makes fewer assump- 
tions (such as joint normality) on the relation be- 
tween T* and T than does the principal components 
method, but is tied to a particular metric. 

The nearest-neighbor method uses that day in the 
past which is most similar to the present day in some 
appropriate sense. The remainder of that past day 
beyond time r is then taken as a predictor of the 
remainder of the present day. 

The method requires a suitable distance m be- 
tween days. We have investigated two possible dis- 
tances: 



(16) mi{e,d) 
and 

(17) m2(e,d) = 



\vi{e,t) - Vi{d,t)\ 



<,t<T 



Y^{T:{t)-Tm' 



1/2 



\t<T 



Now, if day d' minimizes the distance to e among 
all dG D, our prediction is 



(18) 



Sensible modifications of the method are windowed 
nearest neighbors and /c-nearest neighbors. Windowed- 
NN recognizes that not all information prior to r is 
equally relevant. Choosing a window size w, it takes 
the above summation to range over all t between 
T — w and r. The A;-nearest neighbor modification 
finds the k closest days in D and bases a predic- 
tion on a (possibly weighted) combination of these. 
However, neither of these variants appears to signif- 
icantly improve on the vanilla . 

6.2 Results 

To compare these methods we used flow and oc- 
cupancy data from 116 single-loop detectors along 
48 miles of I- 10 East in Los Angeles (between post- 
miles 1.28 and 48.525). Measurements were done at 
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5-minute aggregation at times t ranging from 5 A.M. 
to 9 P.M. for 34 weekdays between June 16 and 
September 8, 2000. We used the methods we have 
previously described to convert flow and occupancy 
to velocity. 

The quality of our I- 10 data is quite good and we 
have used simple interpolation to impute wrong or 
missing values. The resulting velocity field Vi{d,t) 
is shown in Figure 9 where day d is June 16. The 
horizontal streaks typically indicate detector mal- 
function. 

From the velocities we computed travel times for 
trips starting between 5 A.M. and 8 p.m. Figure 10 
shows these T^it) where time of day t is on the hor- 
izontal axis. Note the distinctive morning and after- 
noon congestions and the huge variability of travel 
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Fig. 9. Velocity field V{d,l,t) where day d= June 16, 2000. 
Darker shades indicate lower speeds. 



times, especially during those periods. During after- 
noon rush hour we find travel times of 45 minutes to 
up to two hours. Included in the data are holidays 
July 3 and 4 which may readily be recognized by 
their very short travel times. 

We have estimated the root mean squared (RMS) 
error of our various prediction methods for a number 
of "current times" r (r = 6 A.M., 7 A.M., . . . , 7 p.m.) 
and lags (5 ((5 = and 60 minutes). The RMS errors 
were estimated by leaving out one day at a time, 
performing the prediction for that day on the ba- 
sis of the remaining other days, and averaging the 
squared prediction errors. 

The prediction methods all have smoothing pa- 
rameters that must be specified. For the regression 
method we chose the standard deviation of the Gaus- 
sian kernel K to be 10 minutes. For the principal 
components method we chose the number of eigen- 
values retained to be four. For the nearest-neighbors 
method we have chosen distance function (17), a 
window w of 20 minutes and the number k of near- 
est neighbors to be two. The results were fairly in- 
sensitive to these precise choices. 

Figures 11 and 13 show the estimated RMS pre- 
diction errors of the historical mean u[t + 5), the 
current status predictor T*{t) and our regression 
predictor (15) for lag 5 equal to and 60 minutes, 
respectively. Note how T*{t) performs well for small 
5 (5 = 0) and how the historical mean does not be- 
come worse as 5 increases. Most importantly, how- 
ever, notice how the regression predictor dominates 
both. 

Figures 12 and 14 again show the RMS predic- 
tion error of the regression estimator. This time, it 
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Fig. 12. Estimated RMSE, lag = minutes. Principal com- 
ponents (~ ■ ~), nearest neighbors ( ) and linear regression 
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Fig. 14. Estimated RMSE, lag = 60 minutes. Principal com- 
ponents (- ■ -), nearest neighbors ( ) and linear regression 

(-)■ 



is compared to the principal components predictor 
and the nearest-neighbors predictor (18). Again, the 
regression predictor comes out on top, although the 
nearest-neighbors predictor shows comparable per- 
formance. 

The RMS error of the regression predictor stays 
below 10 minutes even when predicting an hour 
ahead. We feel that this is impressive for a trip of 48 
miles through the heart of Los Angeles during rush 
hour. 

Comparison of the regression predictor to the prin- 
cipal components and nearest-neighbors predictors 
is surprising: the results indicate that given T*{t), 
there is not much information left in the earlier T*{t) 
{t < r) that is useful for predicting T{t + 6), at least 




by the methods we have considered. In fact, we have 
come to believe that for the purpose of predicting 
travel times, all the information in the vi{d,t) up 
to time T is well summarized by one single number: 

r*(r). 

Recently, Nikovski et al. (2005) compared the per- 
formance of several statistical methods on data from 
a 15-km stretch of freeway in Japan. Their conclu- 
sions mirrored ours: a regression approach outper- 
formed neural networks, regression trees and nearest- 
neighbor methods. They also reached the conclusion 
that the predictive information is contained in the 
current travel time. 

6.3 Further Remarks 

It is of practical importance to note that our pre- 
diction can be performed in real time. Computation 
of the parameters a and f3 is time consuming but it 
can be done off-line in reasonable time. The actual 
prediction is then trivial to compute. 

We conclude this section by briefly pointing out 
two extensions of our prediction method: 

1. For trips from a to c via b we have 



(19) 



Tdia,c,t) =Ta{a,b,t) 



+ Td{b,c,t + Ta{a,b,t)). 



Fig. 13. Estimated RMSE, lag = 60 minutes. Historical 
mean (- ■ -), current status ( ) and linear regression ( — ). 



We have found that it is sometimes more practi- 
cal or advantageous to predict the terms on the 
right-hand side than to predict rrf(a, c, t) directly. 
For instance, when predicting travel times across 
networks (graphs), we need only predict travel 
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times for the edges and then use (19) to piece 
these together to obtain predictions for arbitrary 
routes. 

2. In the discussion above we regressed the travel 
time Tci{t + 6) on the current status T^{t), where 
Td{t + 6) is the travel time departing at time t + 6. 
Now, define Sd{t) to be the travel time arriving 
at time t on day d. Regressing Sd{t + 6) on T^{t) 
allows us to make predictions on the travel time 
subject to arrival at time t + 6. The user can thus 
ask what time he or she should depart in order to 
reach an intended destination at a desired time. 

7. CONCLUSION 

Modern communication and computational facil- 
ities make possible, in principle, systematic use of 
the vast quantities of historical and real-time data 
collected by traffic management centers. Such ef- 
forts invariably require substantial use of statistical 
methodology, often of a nonstandard variety, sensi- 
tive to computational efficiency. 

This paper has concentrated on data collected by 
inductance loops in freeways, but similar data is of- 
ten available on arterial streets as well, which have 
more complex flows and geometry. There is also in- 
formation from other types of sensors. For example, 
declining costs make video monitoring an attractive 
technology, bringing with it challenging problems in 
computer vision and statistics. As another example, 
data derived from transponders installed in individ- 
ual vehicles for automatic toll payments is a poten- 
tially rich source of information about traffic flow, 
since the tags can in principle be sensed at locations 
other than toll booths. Effective extraction of infor- 
mation will require active collaborations of statis- 
ticians, traffic engineers, and specialists in various 
other disciplines. 
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